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Abstract —Markov Decision Processes (MDPs) have been 
used to formulate many decision-making problems in science 
and engineering. The objective is to synthesize the best decision 
(action selection) policies to maximize expected rewards (or 
minimize costs) in a given stochastic dynamical environment. In 
this paper, we extend this model by incorporating additional in¬ 
formation that the transitions due to actions can be sequentially 
observed. The proposed model benefits from this information 
and produces policies with better performance than those of 
standard MDPs. The paper also presents an efficient offline 
linear programming based algorithm to synthesize optimal 
policies for the extended model. 

I. Introduction 

Markov Decision Processes (MDPs) have been used to 
formulate many decision-making problems in a variety of 
areas of science and engineering [l]-[3]. MDPs have proved 
useful in modeling decision-making problems for stochastic 
dynamical systems where the dynamics cannot be fully 
captured by using first principle formulations. MDP models 
can be constructed by utilizing the available measured data, 
which allows construction of state transition probabilities. 
Hence MDPs play a critical role in big-data analytics. 
Indeed very popular methods of machine learning such as 
reinforcement and its variants [4] [5] are built upon the 
MDP framework. With the increased interest and efforts in 
Cyber-Physical Systems (CPS), there is even more interest in 
MDPs to facilitate rigorous construction of innovative hier¬ 
archical decision-making architectures, where MDP frame¬ 
work can integrate physics-based models with data-driven 
models. Such decision architectures can utilize a systematic 
approach to bring physical devices together with software 
to benefit many emerging engineering applications, such as 
autonomous systems. 

In many applications [6] [7], MDP models are used to 
compute optimal decisions when future actions contribute to 
the overall mission performance. Here we consider MDP- 
based stochastic decision-making models [8]. An MDP 
model is composed of a set of time instances (epochs), 
actions, states, and immediate rewards/costs. Actions transfer 
the system in a stochastic manner from one state to another 
and rewards are collected based on the actions taken at the 
corresponding states. Hence MDP models provide analytical 
descriptions of stochastic processes with state and action 
spaces, the state transition probabilities as a function of 
actions, and with rewards as a function of the states and 
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actions. The objective is to design the best decision (action 
selection) policies to maximize expected rewards (minimize 
costs) for a given MDP. 

With the advent of Internet of Things (loT) and the 
increasing sensing capabilities, increasingly large amounts 
of data are collected. This paper aims to extend the typical 
MDP framework to exploit additional sensed information. In 
particular, we consider a scenario where not only the current 
state of the agent is known but also the transition due to an 
action can be observed in a sequential manner; The outcome 
of action 1 is observed and a decision is made on whether 
to rake the action or not, and this process is continued until 
one of the actions (in the given order) is taken. Decisions 
are taken at instances called phases. A phase starts with an 
observation for the transition caused by an action and ends 
with a decision about whether to take this action or not. 

MDPs have been widely studied since the pioneering 
work of Bellman [9], which provided the foundation of 
dynamic programming, and the book of Howard [10] that 
popularized the study of decision processes. The standard 
MDP models are applied to diverse fields including robotics, 
automatic control, economics, manufacturing, and commu¬ 
nication networks. There have been several extensions and 
generalizations of the MDP models to fit specific application 
requirements and considerations into the models. Typical 
MDP problems assume that at every decision epoch, agents 
know their current state, and the reward for choosing an ac¬ 
tion, while the environment is stochastic, i.e., the transitions 
cannot be predicted in a deterministic manner. For example 
partially observed MDPs (POMDPs) extend the typical MDP 
problems to take into account uncertainties in the agent state 
knowledge [11]. There can also be uncertainties in state 
transition/reward models. Learning methods are developed to 
handle such uncertainties (e.g., reinforcement learning [12]). 
In typical MDPs, decisions are taken on discrete epochs. 
Continuous-time MDPs [13] extend this model by relaxing 
the assumption of discrete events and models to continuous 
time and space models. Another extension is the Bandit 
problem [14], where the agents can observe the random 
reward of different actions and have to choose the actions that 
maximize the sum of rewards through a sequence of repeated 
experiments. In other decisions-making problems, determina¬ 
tion of optimal stopping time is studied to determine optimal 
epoch for a particular action [15, Chapter 13]. In other 
applications, multi-objective cost functions or constraints are 
considered for the computation of the optimal MDP policies 
[16]. 

In most of the relevant literature, the extensions to the 



Fig. 1. The figure shows an example where sequential MDP models can be 
applied. The vehicle knows the historical data for the congestion for there 
separate routes to a common destination. But once the vehicle is on the 
turn, it can observe the actual real-time congestion status of one route at 
a time with a fixed sequence of observations and only knows the expected 
congested status of the upcoming routes. If the action to take the turn is 
rejected (route not taken) the vehicle cannot come back. 

standard MDP models are obtained by relaxing some of 
its assumptions (like observability of cutTent state, known 
rewards, transition probabilities, etc.). In this paper, however, 
we extend typical MDP problems by considering a more 
general model when more information about the environment 
and the process is available. This latter assumption is moti¬ 
vated by the fact that the evolving field of loT is providing 
agents with a lot of additional data that can be utilized in 
the model to synthesize better decision-making strategies. 
In particular, we assume that not only the current state, but 
the environmental transition due to possible actions are also 
observed in a sequential manner. We aim to build decision¬ 
making models that benefit from this class of information to 
generate policies having better total expected rewards. 

II. Sequentially Observed MDP 
A. Examples 

This section presents several motivating examples for 
sequentially observed MDPs. 

1) Routing: Consider a vehicle that aims to go to a 
final desired position (or a packet if a computer network 
is considered) and there are three possible routes from the 
cuiTent one-way street that the vehicle is on (Fig. [B. The 
current street and the exits form the shape of letter “E”, that 
is, if the vehicle passes an turn then the corresponding route 
is ruled out. 

Each route can have congestion, for which there is prior 
knowledge based on historical data. The vehicle can only 
observe the current traffic conditions when it is at the 
turn. If the vehicle decided not to take one route based 
on the observed congestion, it cannot get back later after 
it observes the other route. The vehicle is forced to take 
one of the routes, so if it rejects all observed congested 
routes, it will be stuck with the last choice and should take 
it regardless of the route congestion status. The question 
here is whether the vehicle would take a route given the 
observed congestion and historical data for the next turn. 
Standard MDP models will select beforehand routes having 
the lowest average (expected) congestion regardless of the 
observed routes status. 


2) University Admission: Suppose that a university has 
a certain number of scholarships for a program. Applicants 
are interviewed in a sequential manner on different selection 
rounds, i.e., in the first round the candidates are interviewed, 
evaluated, and admission decisions are announced before 
other candidates can apply in the second round. If a student 
is granted a scholarship, the available funding is decreased 
and the system changes its state. The rewards obtained are 
assumed to be the evaluation of the profile of the selected 
candidates assuming all applicants can be evaluated and 
compared by a scoring function. The committee knows what 
the average score of applicants would be at different rounds 
(based prior data). Note that the evaluation committee can 
observe the profile of candidates at a given round, but they 
only know the average profile score of the next rounds. The 
question in this scenario is; Given the current (observed) 
applicant pool and expected pool for the next rounds, how 
many of the applicants in this round should be accepted? 
If we use a standard MDP model, then the solution would 
be to select all candidates from the round with the highest 
average. Clearly this solution is not practical in this scenario 
because very important information are being discarded and 
a better approach must be used. 

3) Market Investment: Another possible application for 
the sequential MDP model proposed in this paper is the 
market investment. Suppose that an investor has certain 
amount of resources to invest in an open market (a market 
where prices change in a continuous manner like currency 
exchange). The investor knows on average the price values 
(for example low season and high season prices). However, 
in a given period known to have high prices, the investor 
observed that the market is announcing lower prices than 
usual. Should he invest in that period or should he wait 
to the next low season prices? Again typical MDP solution 
would give before hand policies that do not take into account 
observed outcomes. An MDP solution in this scenario would 
behave inefficiently. 

B. Model 

The new sequentially observed MDP model has the fol¬ 
lowing components: 

« The cuiTent state and the transition probabilities are 
known, i.e., the probability of transitioning from any 
state i to another state j when an action a is taken. 

• At a given decision epoch t, the agent observes the 
possible next state if action ai was taken, but only 
knows the transition probabilities for the rest of the 
actions. The agent must either accept or reject the 
transition due to oi. Accepting the transition means the 
agent chose action ai at time epoch t, rejecting the 
transition means that the agent will not choose action 
fli and the action must be chosen from the remaining 
possible actions. 

« Only after the rejection of oi, the agent can observe 
the deterministic transition if 02 is taken, and only 
knows the probability of transitions for the remaining 
actions. Again, accepting the transition means the agent 














has chosen action 02 at decision epoch t. Rejecting the 
transition means that the agent will not choose action oi 
or 02 and the action must be chosen from the remaining 
possible actions. 

• The procedure is repeated till the action m — 1. If 
the observed transition due to Om-i was rejected, then 
the agent has no choice and must choose am (without 
observing its corresponding transition). We say that the 
system is at phase k if the agent observes the transition 
due to action o^ and has not yet made a decision (to 
reject or accept it). 

• Once any action is taken (accepting an observed tran¬ 
sition or rejecting all observed transitions), the next 
decision epoch starts. 

Note that a typical MDP decision-making algorithm can 
be adopted as follows: the decision policy is computed by 
using a standard MDP solution method [8] by ignoring the 
observed transitions. For example, if the optimal policy was 
to select a* at decision epoch t, then the agent would discard 
the observed transitions for ai,... , 0 ^- 1 , and would accept 
any observed transition for Ui action. Our goal in this paper 
is to take advantage of the additional observed transitions to 
increase the expected rewards. 

Remark. The proposed model is different from the well 
known “secretary problem” in MDP literature [17], [18]. 
In the secretary problem, a fixed number of people are 
interviewed for a job in a sequential manner, and based on 
the (observed) rank of the current interviewed candidates, 
a decision should be taken whether to accept or reject 
the last interviewed candidate. The main difference with 
the sequentially observed MDPs is that in the “secretary 
problem”, observing a candidate changes the probability of 
future transitions (because of the correlation between the 
events). However, in our model an observation is independent 
from the further environmental dynamics (i.e., observing a 
transition at a given phase does not change the transition 
probabilities for next phases or epochs). Another fundamen¬ 
tal difference is that our model does not necessarily have a 
stopping time, and the horizon can go to infinity which is 
not possible for the secretary problem. □ 

III. Defining the MDP 

A. States and Actions 

Let the set S = {l,...,n} be the set of states having 
a cardinality IS”! = n. Let us define As = to 

be the set of actions available in state s (without loss of 
generality the number of actions does not change with the 
state, i.e., |.4s| = m for any s G S). We consider a discrete¬ 
time system where actions are taken at different decision 
epochs. Let s{t) and a{t) be respectively the state and action 
at the f-th decision epoch. 

B. Decision Rule and Policy 

We define a decision rule Dt at time t to be the following 
randomized function 

Dt : S -G . 4.5 


that defines for every state s G S a. random variable 
Dt{s) G As with some probability distribution defined 
over 7^ (.4s). In typical MDPs, the decision variables are 
directly the probability distribution of this random variable 
Pi{a,t) = Prob[Dt = a\s{t) = i] for any action a G Ai 
and given any state i. In the sequential MDP, the decision 
variables are whether to accept or reject a given transition at 
phase k. We then define the decision variables as follows: 

Pi{j, k,t) = Prob [Accepting observed transition to state j \ 
System is in state i and phase k and epoch t]. 

Since there are only m — 1 phases, we assume Pi{j, k,t) = ^ 
if k = m. In this new formulation, the order of the actions 
is important. 

Let 

TT = {Di,D2, . . . , Dff-l) 

be the policy for the decision making process given that 
there are W — 1 decision epochs. Then in typical MDP, the 
decision Dt is defined by the independent vector variables 
{pi(f),..., Pn(f)} where pi is the vector having the prob¬ 
abilities pi{a,t) > 0 for all a G Ai and decision epoch t 
and such that = 1- '^he sequential MDP, the 

decision Dt is defined by the independent matrix variables 
{Pi{t),. ■ ■, Pn{t)} where Pi{t) is the matrix having the 
probabilities Pi{j,kA) G [0,1] for all destination states 
j G S, for fc = 1,..., TO, and decision epoch t. For notation 
simplicity we will drop the index t from the notation when 
there is no confusion and variables are denoted simply by 
Pi, the upcoming results are for time dependent cases. Note 
that this decision rule has a Markovian property because it 
depends only on the current state. Indeed this paper considers 
only Markovian policies, history dependent policies [8] are 
not considered. 

C. Rewards 

Given a state s G S and action a € A, we define the 
reward rt{s, a) € K. to be any real number and let TZ to be 
the set having these values. With a little abuse of notation, 
we define the expected reward for a given decision rule Dt 
at time t to be 

Tt(s) = E[rt(s, A(s))] = Psia)rt{s,a), (1) 

Q'G.As 

and the vector Vt G K." to be the vector with the expected 
rewards for each state. Given there are iV—1 decision epochs, 
then there are N reward stages and the final stage reward is 
given by rjv(s) (or r^r the vector having as its elements the 
final reward at a given state). 

D. State Transitions 

We now define the transition probabilities as follows, 
Gi{j,k,t) = Prob[s(f-|-1) = j\s{t) =i,phase k], and Gi{t) 
be the corresponding matrix (for simplicity we will drop 
the index t from the notation when there is no confusion 
and transitions are denoted simply by Gi). Let Q be the set 
having these transition matrices. Let’s define an intermediate 



variable qi{ak) for notational convenience, which is the 
probability of choosing action ak given that the previous 
actions ai, ...,ak-i are rejected 

9^iak) = ^ G^(j, k)Pi{j, k). 

Then the probability that the agent chooses action ak is the 
probability that the agent rejects the first k — 1 actions (i.e., 
nti (1 — qi{ai))) and then accepts the fc-th action (i.e., 
gi(afc)): 


Pi{ak) 


^k-l 


(1 - qi{ai)) qi{ak) if 1 < fc < to, (2) 


\i=i 


where, by convention, nfji' (1 - = 1 if fc = 1. We 

observe that qi(afc) = l if k = m. The above relation shows 
that the decision variables due to the typical MDP {pi{ak) for 
fc = 1,..., TO and i = 1,..., n) are a non-convex function 
of the decision variables of the sequential MDP (Pi for i = 
1,... ,n). The transition probability from a state * to a state 
j is given by the probability to reach phase fc and transition 
to state j is accepted, i.e.. 


= Prob[st+i = j\st = i] 

m /k—1 \ 

=E n ^)- (3) 

fc=i \i=i / 


Also in this case, the transition is not linear in the decision 
variables for the sequentially observed MDP. Let Xi(t) = 
Prob[st = i|si] be the probability of being at state i at time 
t, and x{t) S K'" to be the vector of these probabilities. 
Then the system evolves according to the following recursive 
equation; 

x{t + 1) = Mtx{t), 

where Mt (or simply M) is the matrix having the elements 
^tUG) (or simply M(j,i)). It is important to note that the 
i-th column of M (its transpose is denoted by is a 

function of the decision variables in the matrix Pi only (i.e., 
independent of the variables of the matrices Pg for s ^ i). 

E. Markov Decision Processes (MDPs) 

Let 7 € [0,1] be the discount factor, which represents 
the importance of a current reward in comparison to future 
possible rewards. We will consider 7 = 1 throughout the 
paper, but the results are not affected and remain applicable 
after a suitable scaling when 7 < 1. 

A discrete MDP is a 5-tuple (S', As, (/, 7^, 7 ) where S is 
a finite set of states, Ag is a finite set of actions available for 
state s, Q is the set that contains the transition probabilities 
given the current state and current action, and TZ is the set 
of rewards at a given time epoch due to the current state and 
action. 


F. Performance Metric 

For a policy to be better than another policy we need 
to define a performance metric. We will use the expected 
discounted total reward for our performance study. 


— ]Ex(i) 


'N-l 

E 


rt{XuDt{Xt))+rM{XM) 


where Xt is the state at decision epoch t and the expectation 
is conditioned on a probability distribution over the initial 
states (i.e., x(l) € V{S) where Xi{l) — Prob[si = i]). It is 
worth noting that both Xt and Dt{Xt) are random variables 
in the above expression. 


G. Optimal Markovian Policy 

The optimal policy tt* is given as the policy that max¬ 
imizes the performance measure, tt* = argmax^u](,, and 
v*f^ to be the optimal value, i.e., v’^ = max^z;](,. Note 
that the optimization variables of the above maximization 
are Pi{t), ..., P„(f) for t = 1,..., — iQ For the typical 

MDP, the backward induction algorithm [8, p. 92] gives the 
optimal policy as well as the optimal value. However, in 
our new model the optimization variables are different and 
another algorithm for finding optimal policies is needed. In 
the following sections, we will give such an algorithm for 
the sequential MDP (SMDP) and we will show its optimality 
using Bellman equations of dynamic programming. 


IV. Dynamic Programming (DP) Approach for 
MDPs 

In this section, we transform the MDP problem into a 
deterministic Dynamic Programming (DP) problem and use 
this approach to devise an efficient algorithm for finding 
optimal policies of the new introduced model. First note that 
the performance metric can be written as follows; 

N-l 

vjf = Ex(i) [( rt{Xt, Dt{Xt))) rt{X]i[)] 
t=i 

N-l 

= ^ E^^i)[rt{Xt,DtiXt))]+E^^i^[rt{XN)] 

A^-1 

= E Dt{Xt))]] 

N N 

t=i t=i 

where 6^ is the vector of all zeros except a value 1 at the 
position s. The last equality utilized the fact that Ex(i) [Xt] = 
x(t). 

We can now give the DP formulation. For notation sim¬ 
plicity, let Xi — x(t). The discrete-time dynamical system 

^ Since is continuous in the decision variables that belong to a closed 
and bounded set, then the max is always attained and argmax is well 
defined. 




describing the evolution of the density Xt can then be given 
by 

xt+i = ft{xt,Pi{t), .. ■,Pn{t)) for f = 1,. .. ,iV - 1, 

such that ftixt,Pi{t),...,Pn{t)) = Mtyit where Mt = 
Mt{Pi{t),... ,Pn{t)) is the transition matrix a function of 
the optimization variables. The elements of the i-th column 
in Mt are functions of only the elements in Pi{t) matrix 
as mentioned earlier. The above dynamics show that the 
probability distribution evolves deterministically. Our policy 
TT = (Z?i,..., Ojv-i) consists of a sequence of functions 
that map states Xt into controls Pi{t) = for all i in 

such a way that € C(xf) where C(xt) is the set of 

constraints on the control. Since the only constraints on the 
decision variables are that they are restricted to the interval 
[0,1], then C(xt) is independent of x* and all admissible 
controls belong to the same convex set C for any given state. 

The additive reward per stage is defined as (?Ar(xAr) = 
x^r^r and 

5t(xt,Pi(f),. . .,Pn{t)) = xfrt, for f = 1,... ,7V - 1. 

The dynamic programming then calculates the optimal value 
v’^ (and policy tt*) by running Algorithm [T][ 19, Proposition 
1.3.1, p. 23]. 


Algorithm 1 Dynamic Programming 
1: Start with Jn{x.) = gN{^) 

2 : for f = TV — 1,..., 1 

Jt(x) = max |5t(x,Pi(f),...,P„(f)) + 
Jt+1 (/t (x. Pi (f),..., P„ (f))) I. 


3: Result: Ji(x) = 


Remark. There are several difficulties in applying the DP 
Algorithm [T] Note that in the term Jt+i{ft{x, Pi,..., P„)) 
used in the algorithm P^s are the optimization variables. For 
a given Pi and x, numerical methods can be used to compute 
the value of Jt+i- But since Pi itself is an optimization 
variable, the solution of the optimization problem in line 
2 of Algorithm [T| can be very hard. In some special cases, 
for example when Jt(x) can be expressed analytically in 
a closed from, the solution complexity can be reduced 
significantly, as we will show next for the sequential MDP 
problems. □ 

A. Backward Induction for the sequential MDP model 
This section presents the optimal backward induction algo¬ 
rithm for solving the sequential MDP by using the dynamic 
programming approach. The set of admissible controls at 
time t is given by C(xt) = C defined as follows: 

0 < Pi{j, k,t) <1 for all i G S,j € S,k G Ai 


Proposition 1. The term Jt (x) in the dynamic programming 
algorithm for the sequential MDP has the following closed- 
form solution: 


Jt{x) = x^Vt\ 


where Vf is a vector that satisfies the following recursion, 
Vfi = vn and for f = 7V — !,...,! we have 

Vtii) = max {rt{i) + for i = 1,... ,n. 

Pi (t) 


Proof We will show that by induction. From the definition 
of pjv(-) we have the base case satisfied (i.e., Jt(x) = 
x^tat = Suppose the hypothesis is true from TV — 

1,... ,f + 1, then we show it is true for t. From the DP 
algorithm, we can write 


Jt{^) 


max 

max 

Pi(£),...,PnCi)eC 

max 

Pi(t),...,P^(t)GC 


{x^rt + Jt+i{Mtx)) 


'T ■ I m Qv J -r, P 


(4) 

(5) 

(6) 

(7) 


where Mf'^ indicates the transpose of the T-th column of 
M which is a function of the decision variables of the Pi 
matrix only. The transition from (|4]i to Q is due to the 
induction assumption, and the transition from (|6|l to Q is 
because Xi > 0 for all i and the function is separable 
in terms of the optimization variables. The maximization 
inside the parenthesis is nothing but then Jt(x) = 

'^his ends the proof. □ 


Notice that Jt(x) has a closed-form equation as function 
of X and so it suffices the calculation of Vf for f = TV, ..., 1 
for finding the optimal value of the MDP given by v'fi = 
Ji(xi) = xfVf. The backward induction algorithm is given 
in Algorithmic 


Algorithm 2 Backward Induction: Sequential MDP Optimal 
Policy 

1 : Definitions: For any state s G S', we define 
Vf{s) = [Ek^t"MXk,Dk{Xk))+rtiXN) 

and kt*(s) = max^rV'j'^ given that St = s. 

2: Start with V^(s) = rAr(s) 

3: for t = TV — 1,..., 1 given Vtfi and for s = 1,...,: 
calculate the optimal value 

Vt*{s) = max I rt{s) + ^ Mt{j, s)Vt+i(j) 

[ jes 

and the optimal policy P*{t) given by: 


Kit) = argmax I rt(s) + ^ Mt{j, s)VtK{j) 


Using the dynamic programming Algorithm [T] we can now 
give the following proposition: 


4: Result: U|*(si) = where si is the initial state. 










Remark: We want to stress two points about the al¬ 
gorithm. First, the policy calculated by Algorithm |2] is 
optimal (maximizing the total expected reward) because of 
line 3 in Algorithm [T] and Proposition [T] Second, rt(s) and 
Mt(j, s) are both functions of the decision variables in Pi. In 
typical MDPs, these values are simply linear in the decision 
variables. However, in the proposed sequential MDP model, 
these values are non-convex in the decision variables and a 
further processing is needed for efficient implementation of 
the algorithm, which is discussed next. 

B. Efficient Implementation of Algorithm^ 

In the internal loop of Algorithmic the optimal value at 
a given decision epoch t is given by the following equation: 




(8) 


where Vtii) = nf) + In this for¬ 
mulation, rt{i) and are functions of the decision 

variable Pi{t), for given state i and time epoch t. In partic¬ 
ular, the explicit expression can be deduced from Eq. O, 
Eq. dC, and Eq. 0 as follows: 

= X! Vi{a)rt{i,a) 
a^As 

m /k—1 \ 

= X! ( n “ %(ai)) ) qi{ak)rt{i, o/c). 

k=l Vi=l / 


Eor efficient implementation of the algorithm, it remains 
to show what conditions should Xi {j, k) satisfy so that 
the mapping Xi{j,k) = {I - qi{ai)) Pi{j, k) is in- 

vertable. Notice that if qi{ai) ^ 1 for I = 1,... ,m — 1, 
then the mapping is one-to-one mapping and we will give 
the expression for Pi in terms of Xi shortly after. If there 
exists I such that qi{ai) — 1, then the phases k > Imin 
are not reached because an earlier action must necessarily 
be accepted where Imin = min{Z|gi(ai) = 1}. This means 
that Vt{i) is independent of Pi{j,k) when k > Imin (i-e., 
the optimal value is not affected by these variables) and 
without loss of generality we can consider Pi {j, k) = 1 for 
j = 1,..., n and fc = Imin -b 1,..., m. 

We can give now the expression of Pi in terms of Xi by 
the following lemma: 

Lemma 1. For a given state i, the following equation holds 
for Xi(j, k), j = 1,... ,n and k = 1,... ,m, in Eq. (I14t .' 

( k—l n \ 

(15) 

1=1 / 

Proof. We will prove this lemma by showing that 
itiZl (1 - q^{al)) = 1 - ELi G^{s,l)X,{s,l) by 

induction. It is true for fc = 2 by the definition of qi{ai). 
Suppose it is true till k — 2, and let us show it true for k—l. 
We have 


and 


k-l 


/k-2 


m /k—l 


n (1 G^{j,k)P^{j,k), (9) 


1=1 


k=i \i=i 


where qi{ak) = Gi{j,k)Pi{j,k). By substituting these 
equations in the expression of Vt{i), we obtain 


(1 - q^ial)) = 1 ^]^ (1 - q^ial))j (1 - q^iak-l)) 

k-2 

= (n 


1=1 


Vt{i) = n{^) + 


( 10 ) 


jes 


/k-l 


= X! n (1 ) ^Gi{jffi)P,{j,k) rtf,Ok) 

k=i \i=i J \i=i J 

n / m /k — l \ \ 

+E E n (1 - k)yt\i{j) 

j=i \k=i \i=i / / 

(11) 

771 72 

= EE (rt(i, at) + Vtfi(j)) Gfj, k)Xfj, k) (12) 

k=l j=l 


- ^ Gi{s, k - k-l 

S 

k — l 72 


1 = 1 S = 1 

where the last equality uses the induction hypothesis. □ 

It remains to derive the constraints on Xi{j, k) when Pi G 
C. Since Pi{j,k) G [0,1] for all j = l,...,n and k = 
1,... ,m — 1, then we can derive the following conditions: 

k — l 72 


k=l j=l 


(13) 


where 


k-l 


Xi{i,k) := Y \.{1 - q/ai)) Pi{j,k) (14) 

1=1 

H/j, k) := {rtf, at) + Vf+iU)) Gfj, k). 

Note that Hi{j,k) is independent of the decision variables. 


0 < xfj, k)<i-j2Y. 

s^l 

and since by definition Pi{j, m) = 1 for all j = 1, ..., n: 

772 — 1 72 

XiU, m) = 1 - G,(s, l)Xfs, 1). 

1^1 s^l 

As a result, is the solution of the following linear 



program 


Utility difference between sequential MDP and typical MDP 


maximize EE 

k=lj=l 

subject to; for j = 1,... ,n and fc = 1,...,m — 1 

k—l n 

0 < x,(j, fc) < 1 - ^ ^ G,(s, 0, 

S^l 

m—1 n 

X,(j,m) = l- 

1 = 1 S = 1 

To write it in matrix form, let 1„ be the vector of all ones 
and dimension n, J = Inl^, and i? be a constant m x m 
matrix defined as B{l,k) = 1 if k > I and B{l,k) = 0 
otherwise. 

Lemma 2. The linear program (116b can be written in matrix- 
form as follows: 

maximize Tr{Hf Xf) 

subject to 0 < XiJ(Gi Q Xi)B < Inl^ 

(Xi + J(Gi © Xi)B) Gm = In- 

Let z{k) = 1 - ElZlE"^lG^{s,l)X*{s,l) if k = 
2,... ,m and z(l) = 1. The following proposition summa¬ 
rizes our results 


Proposition 2. For a given decision epoch t and state i, the 
optimal value and optimal policy terms in Algorithm \2\ are 
given by 


Vf{i) = Tr{HfX*), 


and for j = 1 ,..., n and k = 1,... ,m 
Pi{j,k,t) = 


^t{Bk)/z{k) ifz{k)>0, 

1 else . 


(18) 


where X* is the solution of the linear program (117b . 
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Fig. 2. The figure shows the difference in the utility (optimal value) of the 
sequential MDP strategy that takes advantage of the observed transitions 
and the standard MDP that does not use this extra information. The figure 
shows that the difference in the utility depends on the initial position of 
agents. Some bins can give a higher than expected reward than other bins. 


environment can send the vehicle to “left”). In particular, 
with probability 0.6 the given command will lead to the 
desired bin, while with probability 0.4 the agent would land 
on another neighboring bin. We assume a region describe 
by a 10 by 10 grid. Each vehicle has 5 possible actions: 
“up”, “down”, “left”, “right”, and “stay”. When the vehicle 
is on the boundary, we set the probability of actions that 
cause transition outside of the domain to zero. The total 
number of states is 100 with 5 actions, and a decision time 
horizon N = 10. The reward vectors Rt for t = 1,... ,N — 1 
and i?Ar are chosen randomly with entries in the interval 
[0,100]. Since any feasible policy for a standard MDP is 
also a feasible solution for the proposed sequential model 
(i.e., ttmdp Q t^smdp), then the following holds: 


Proof. The proof is based on the fact that the linear program 
in the decision variables Xi is equivalent to the original 
optimization over the Pi variables because the mapping 
between the variables is one-to-one mapping when consid¬ 
ering the additional (redundant) constraints: Pi{j, k) = 1 for 
J — 1, . . . , U and k — Imin “b ■ • ■ ; ^ 

V. Simulations 

This section presents a simulation example to demon¬ 
strate the proposed policy synthesis method for the MDPs 
with sequentially observed transitions. In this application, 
autonomous vehicles (agents) explore a region F, which can 
be partitioned into n disjoint subregions (or bins) Fi for 
i = 1,..., n such that F = UiE) [20], [21]. We can model 
the system as an MDP where the states of agents are their 
bin locations and the actions of a vehicle are defined by 
the possible transitions to neighboring bins. Each vehicle 
collects rewards while traversing the area where, due to 
the stochastic environment, transitions are stochastic (i.e., 
even if the vehicle’s command is to move to “right”, the 


„.'’^MDP ^ „.‘’^SMDP 

Un ^ 

Eigure |2| shows the difference in values due to optimal 
policies of the standard MDP model and the proposed 
sequential MDP (i.e., The figure shows 

that, depending on initial state, the new model can have sig- 
nihcant improvement by utilizing the additional information 
(observing the transitions before deciding on actions). 

VI. Conclusion 

This paper introduces a novel model for MDPs that 
incorporates additional observations on the transitions for a 
given action in a sequential manner. This model achieves 
better expected total rewards than the optimal policies for 
the standard MDP models studied in the literature due to 
the utilization of additional information. We also propose an 
efficient algorithm based on linear programming that allows 
offline calculations of these optimal policies. 





References 


[1] D. C. Parkes and S. Singh, “An MDP-based approach to Online Mech¬ 
anism Design,” in Proc. 17th Annual Conf. on Neural Information 
Processing Systems (NIPS’03}, 2003. 

[2] D. A. Dolgov and E. H. Durfee, “Resource allocation among agents 
with mdp-induced preferences,” Journal of Artificial Intelligence Re¬ 
search (JAIR-06), vol. 27, pp. 505-549, December 2006. 

[3] P. Doshi, R. Goodwin, R. Akkiraju, and K. Verma, “Dynamic work- 
flow composition using markov decision processes,” in Web Services, 
2004. Proceedings. IEEE International Conference on, July 2004, pp. 
576-582. 

[4] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning. 
MIT Press, 1998. 

[5] C. Szepesvari, “Algorithms for reinforcement learning,” Synthesis 
Lectures on Artificial Intelligence and Machine Learning, vol. 4, no. 1, 
pp. 1-103, 2010. 

[6] E. Feinberg and A. Shwartz, Handbook of Markov Decision Processes: 
Methods and Applications, ser. International Series in Operations 
Research & Management Science. Springer US, 2002. 

[7] E. Altman, “Applications of Markov Decision Processes in Communi¬ 
cation Networks : a Survey,” INRIA, Research Report RR-3984, 2000. 

[8] M. L. Puterman, Markov decision processes : discrete stochastic dy¬ 
namic programming, ser. Wiley series in probability and mathematical 
statistics. New York: John Wiley & Sons, 1994, a Wiley-Interscience 
publication. 

[9] R. Bellman, Dynamic Programming, 1st ed. Princeton, NJ, USA: 
Princeton University Press, 1957. 

[10] R. A. Howard, Dynamic Programming and Markov Processes. Cam¬ 
bridge, MA: MIT Press, 1960. 

[11] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and 
acting in partially observable stochastic domains,” Artificial Intelli¬ 
gence, vol. 101, no. 12, pp. 99 - 134, 1998. 

[12] L. P. Kaelbling, M. L. Littman, and A. P. Moore, “Reinforcement 
learning: A survey,” Journal of Artificial Intelligence Research, vol. 4, 
pp. 237-285, 1996. 

[13] X. Guo and O. Hemndez-Lerma, “Continuous-time markov deci¬ 
sion processes,” in Continuous-Time Markov Decision Processes, ser. 
Stochastic Modelling and Applied Probability. Springer Berlin 
Heidelberg, 2009, vol. 62, pp. 9-18. 

[14] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the 
multiarmed bandit problem,” Machine Learning, vol. 47, no. 2-3, pp. 
235-256, 2002. 

[15] M. H. DeGroot, Optimal Statistical Decisions. Hoboken, NJ: John 
Wiley & Sons, 2004. 

[16] E. Altman, Constrained Markov Decision Processes, ser. Stochastic 
Modeling Series. Taylor & Francis, 1999. 

[17] T. S. Ferguson, “Who solved the secretary problem?” Statist. Sci., 
vol. 4, no. 3, pp. 282-289, 08 1989. 

[18] M. Babaioff, N. Immorlica, D. Kempe, and R. Kleinberg, “A knapsack 
secretary problem with applications,” in Approximation, Randomiza¬ 
tion, and Combinatorial Optimization. Algorithms and Techniques, ser. 
Lecture Notes in Computer Science. Springer Berlin Heidelberg, 
2007, vol. 4627, pp. 16-28. 

[19] D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol.I, 
3rd ed. Athena Scientific, 2005. 

[20] B. Acikmese and D. Bayard, “A markov chain approach to probabilis¬ 
tic swarm guidance,” in American Control Conference (ACC), 2012, 
June 2012, pp. 6300-6307. 

[21] B. Agikme^e, N. Demir, and M. Harris, “Convex necessary and 
sufficient conditions for density safety constraints in Markov chain 
synthesis,” In press, IEEE Trans, on Automatic Control, 2015. 



